dslr photo
Supplementary Material for DreamHuman: Animatable 3D Avatars from Text This document contains additional details and experiments that did not fit in the main text due to
For animations and additional results please also check the included videos. We use a similar optimization strategy with DreamFusion, so unless otherwise noted the hyperparam-eters remain the same. DreamFusion we also train on a TPUv4 machine with 4 chips. We increase the number of optimization iterations from 15,000 to 50,000. We did not observe any significant benefits by training for more iterations.
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Information Technology (0.46)
- Leisure & Entertainment (0.46)
Supplementary Material for DreamHuman: Animatable 3D Avatars from Text This document contains additional details and experiments that did not fit in the main text due to
For animations and additional results please also check the included videos. We use a similar optimization strategy with DreamFusion, so unless otherwise noted the hyperparam-eters remain the same. DreamFusion we also train on a TPUv4 machine with 4 chips. We increase the number of optimization iterations from 15,000 to 50,000. We did not observe any significant benefits by training for more iterations.
Score Distillation via Reparametrized DDIM
Lukoianov, Artem, Borde, Haitz Sáez de Ocáriz, Greenewald, Kristjan, Guizilini, Vitor Campagnolo, Bagautdinov, Timur, Sitzmann, Vincent, Solomon, Justin
While 2D diffusion models generate realistic, high-detail images, 3D shape generation methods like Score Distillation Sampling (SDS) built on these 2D diffusion models produce cartoon-like, over-smoothed shapes. To help explain this discrepancy, we show that the image guidance used in Score Distillation can be understood as the velocity field of a 2D denoising generative process, up to the choice of a noise term. In particular, after a change of variables, SDS resembles a high-variance version of Denoising Diffusion Implicit Models (DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d. randomly at each step, while DDIM infers it from the previous noise predictions. This excessive variance can lead to over-smoothing and unrealistic outputs. We show that a better noise approximation can be recovered by inverting DDIM in each SDS update step. This modification makes SDS's generative process for 2D images almost identical to DDIM. In 3D, it removes over-smoothing, preserves higher-frequency detail, and brings the generation quality closer to that of 2D samplers. Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models.
- North America > United States > Massachusetts (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Singapore (0.04)
- Research Report (0.64)
- Workflow (0.49)
- Leisure & Entertainment > Sports (0.46)
- Automobiles & Trucks > Manufacturer (0.34)
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model
Li, Xiaolong, Mo, Jiawei, Wang, Ying, Parameshwara, Chethan, Fei, Xiaohan, Swaminathan, Ashwin, Taylor, CJ, Tu, Zhuowen, Favaro, Paolo, Soatto, Stefano
In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
Wu, Zike, Zhou, Pan, Yi, Xuanyu, Yuan, Xiaoding, Zhang, Hanwang
Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation, but are vulnerable to geometry collapse and poor textures yet. To solve this issue, we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However, the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy, and thus is not a consistently correct guidance, explaining the vulnerability of SDS. Since for any SDE, there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE, we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically, at each training iteration, given a rendered image by a 3D model, we first estimate its desired 3D score function by a pre-trained 2D diffusion model, and build an ODE for trajectory sampling. Next, we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes, as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D.
- Europe > United Kingdom > England (0.04)
- North America > Cuba > La Habana Province > Havana (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)
- Transportation (0.47)
- Leisure & Entertainment (0.46)
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D
Qiu, Lingteng, Chen, Guanying, Gu, Xiaodong, Zuo, Qi, Xu, Mutian, Wu, Yushuang, Yuan, Weihao, Dong, Zilong, Bo, Liefeng, Han, Xiaoguang
Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://aigc3d.github.io/richdreamer/.
Text-to-3D with Classifier Score Distillation
Yu, Xin, Guo, Yuan-Chen, Li, Yangguang, Liang, Ding, Zhang, Song-Hai, Qi, Xiaojuan
However, it is still challenging and expensive to create a high-quality 3D asset as it requires a high level of expertise. Therefore, automating this process with generative models has become an important problem, which remains challenging due to the scarcity of data and the complexity of 3D representations. Recently, techniques based on Score Distillation Sampling (SDS) (Poole et al., 2022; Lin et al., 2023; Chen et al., 2023; Wang et al., 2023b), also known as Score Jacobian Chaining (SJC) (Wang et al., 2023a), have emerged as a major research direction for text-to-3D generation, as they can produce high-quality and intricate 3D results from diverse text prompts without requiring 3D data for training. The core principle behind SDS is to optimize 3D representations by encouraging their rendered images to move towards high probability density regions conditioned on the text, where the supervision is provided by a pre-trained 2D diffusion model (Ho et al., 2020; Sohl-Dickstein et al., 2015; Rombach et al., 2022; Saharia et al., 2022; Balaji et al., 2022). DreamFusion (Poole et al., 2022) advocates the use of SDS for the optimization of Neural Radiance Fields (NeRF).
IT3D: Improved Text-to-3D Generation with Explicit View Synthesis
Chen, Yiwen, Zhang, Chi, Yang, Xiaofeng, Cai, Zhongang, Yu, Gang, Yang, Lei, Lin, Guosheng
Recent strides in Text-to-3D techniques have been propelled by distilling knowledge from powerful large text-to-image diffusion models (LDMs). Nonetheless, existing Text-to-3D approaches often grapple with challenges such as over-saturation, inadequate detailing, and unrealistic outputs. This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues. Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images based on the renderings of coarse 3D models. Although the generated images mostly alleviate the aforementioned issues, challenges such as view inconsistency and significant content variance persist due to the inherent generative nature of large diffusion models, posing extensive difficulties in leveraging these images effectively. To overcome this hurdle, we advocate integrating a discriminator alongside a novel Diffusion-GAN dual training strategy to guide the training of 3D models. For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data. We conduct a comprehensive set of experiments that demonstrate the effectiveness of our method over baseline approaches.
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.05)
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Asia > China (0.04)
Magic3D: High-Resolution Text-to-3D Content Creation
Lin, Chen-Hsuan, Gao, Jun, Tang, Luming, Takikawa, Towaki, Zeng, Xiaohui, Huang, Xun, Kreis, Karsten, Fidler, Sanja, Liu, Ming-Yu, Lin, Tsung-Yi
DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.
- Europe > United Kingdom > England (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)
- Leisure & Entertainment (0.67)
- Media (0.46)